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Abstract. Grouping structures arise naturally in many statistical mod- 
eling problems. Several methods have been proposed for variable se- 
lection that respect grouping structure in variables. Examples include 
the group LASSO and several concave group selection methods. In this 
article, we give a selective review of group selection concerning method- 
ological developments, theoretical properties and computational algo- 
rithms. We pay particular attention to group selection methods involv- 
ing concave penalties. We address both group selection and bi-level se- 
lection methods. We describe several applications of these methods in 
nonparametric additive models, semiparametric regression, seemingly 
unrelated regressions, genomic data analysis and genome wide associ- 
ation studies. We also highlight some issues that require further study. 

Key words and phrases: Bi-level selection, group LASSO, concave 
group selection, penalized regression, sparsity, oracle property. 



1. INTRODUCTION 

Consider a linear regression model with p predic- 
tors. Suppose the predictors can be naturally di- 
vided into J nonoverlapping groups, and the model 
is written as 

J 

(1.1) y = Y,X J (3 J +e, 

3=1 

where y is an n x 1 vector of response variables, Xj is 
the n x dj design matrix of the dj predictors in the 
jth group, (3j = (/3ji, . . .,(3j dj y e R d > is the dj x 1 
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vector of regression coefficients of the jth group and 
e is the error vector. Without loss of generality, we 
take both the predictors and response to be cen- 
tered around the mean. It is desirable to treat each 
group of variables as a unit and take advantage of 
the grouping structure present in these models when 
estimating regression coefficients and selecting im- 
portant variables. 

Many authors have considered the problem of group 
selection in various statistical modeling problems. 
Bakin (1999) proposed the group LASSO and a com- 
putational algorithm. This method and related group 
selection methods and algorithms were further de- 
veloped by Yuan and Lin (2006). The group LASSO 
uses an £2 norm of the coefficients associated with a 
group of variables in the penalty function and is a 
natural extension of the LASSO (Tibshirani, 1996). 
Antoniadis and Fan (2001) studied a class of block- 
wise shrinkage approaches for regularized wavelet es- 
timation in nonparametric regression problems. They 
discussed several ways to shrink wavelet coefficients 
in their natural blocks, which include the blockwise 
hard- and soft-threshold rules. Meier, van de Geer 
and Biihlmann (2008) studied the group LASSO for 
logistic regression. Zhao, Rocha and Yu (2009) pro- 
posed a quite general composite absolute penalty for 
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group selection, which includes the group LASSO as 
a special case. Huang, Ma, Xie and Zhang (2009) 
considered the problem of simultaneous group and 
individual variable selection, or bi-level selection, 
and proposed a group bridge method. Breheny and 
Huang (2009) proposed a general framework for bi- 
level selection in generalized linear models and de- 
rived a local coordinate descent algorithm. 

Grouping structures can arise for many reasons, 
and give rise to quite different modeling goals. Com- 
mon examples include the representation of multi- 
level categorical covariates in a regression model by 
a group of indicator variables, and the representa- 
tion of the effect of a continuous variable by a set 
of basis functions. Grouping can also be introduced 
into a model in the hopes of taking advantage of 
prior knowledge that is scientifically meaningful. For 
example, in gene expression analysis, genes belong- 
ing to the same biological pathway can be consid- 
ered a group. In genetic association studies, genetic 
markers from the same gene can be considered a 
group. It is desirable to take into account the group- 
ing structure in the analysis of such data. 

Depending on the situation, the individual vari- 
ables in the groups may or may not be meaningful 
scientifically. If they are not, we are typically not 
interested in selecting individual variables; our in- 
terest is entirely in group selection. However, if in- 
dividual variables are meaningful, then we are usu- 
ally interested in selecting important variables as 
well as important groups; we refer to this as bi- 
level selection. For example, if we represent a con- 
tinuous factor by a set of basis functions, the indi- 
vidual variables are an artificial construct, and se- 
lecting the important members of the group is typ- 
ically not of interest. In the gene expression and 
genetic marker examples, however, selection of in- 
dividual genes/markers is just as important as se- 
lecting important groups. In other examples, such 
as a group of indicator functions for a categorical 
variable, whether we are interested in selecting indi- 
vidual members depends on the context of the study. 

We address both group selection and bi-level se- 
lection in this review. The distinction between these 
two goals is crucial for several reasons. Not only are 
different statistical methods used for each type of 
problem, but as we will see, the predictors in a group 
can be made orthonormal in settings where bi-level 
selection is not a concern. This has a number of 
ramifications for deriving theoretical results and de- 
veloping algorithms to fit these models. 



We give a selective review of group selection con- 
cerning methodological developments, theoretical 
properties and computational algorithms. We de- 
scribe several important applications of group selec- 
tion and bi-level selection in nonparametric additive 
models, semiparametric regression, seemingly unre- 
lated regressions, genomic data analysis and genome 
wide association studies. We also highlight some is- 
sues that require further study. For the purposes 
of simplicity, we focus on penalized versions of least 
squares regression in this review. Many authors have 
extended these models to other loss functions, in 
particular those of generalized linear models. We at- 
tempt to point out these efforts when relevant. 

2. GROUP SELECTION METHODS 

2.1 Group LASSO 

For a column vector v G W d with d > 1 and a 
positive definite matrix R, denote 1 1 v 1 1 2 = (v'v) 1 / 2 
and ||v||fl = (v'Rv) 1 / 2 . Let f3 = (J3[,.. . ,/3'j)', where 
G R^ . The group LASSO solution /3(A) is defined 
as a minimizer of 



(2.1) 



1 

2n 



+ A^c i ||/3>., 
i=i 



where A > is the penalty parameter and Rj's are 
dj x dj positive definite matrices. Here the c, 's in the 
penalty are used to adjust for the group sizes. A rea- 
sonable choice is Cj = y/dj. Because (2.1) is convex, 
any local minimizer of (2.1) is also a global min- 
imizer and is characterized by the Karush-Kuhn- 
Tucker conditions as given in Yuan and Lin (2006). 
It is possible, however, for multiple solutions to ex- 
ist, as (2.1) may not be strictly convex in situations 
where the ordinary least squares estimator is not 
uniquely defined. 

An important question in the definition of group 
LASSO is the choice of Rj. For orthonormal Xj with 
X'jXj/n = I dj , j = 1, .. . , J, Yuan and Lin (2006) 
suggested taking Rj = 1^ . However, using Rj = 1^. 
may not be appropriate, since the scales of the pre- 
dictors may not be the same. In general, a reason- 
able choice of Rj is to take the Gram matrix based 
on Xj, that is 



R i 



,j — X'jXj/n, so that the penalty is 
proportional to ||Xj/3j||2. This is equivalent to per- 
forming standardization at the group level, which 



can be seen as follows. Write Rj 



for a dj x dj 



upper triangular matrix Uj via Cholesky decompo- 



sition. Let X 



:> 



and hj = Uj(3j. Criterion 
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(2.1) becomes 



(2.2) — 



J 



2 J 

+ A^c j ||b J -|| 2 . 

2 j=l 



The solution to the original problem (2.1) can be ob- 



tained by using the transformation f3j 
the definition of Uj, we have n~ 1 X'jXj 



'' ' '>,•!>> 

Id • Tliere- 
Uj 

fore, by using this choice of Rj, without loss of gen- 
erality, we can assume that Xj satisfies n~ l X'-Xj = 
Idj • j 1 < 3 ' < </• Note that we do not assume Xj and 
X/., j k, are orthogonal. 

The above choice of Rj is easily justified in the 
special case where dj = 1, 1 < j < J. In this case, the 
group LASSO simplifies to the standard LASSO and 
Rj = \\Xj\\ 2 /n is proportional to the sample vari- 
ance of the jth predictor. Thus, taking Rj to be 
the Gram matrix is the same as standardizing the 
predictors before the analysis, which is often recom- 
mended when applying LASSO for variable selec- 
tion. 

Several authors have studied the theoretical prop- 
erties of the group LASSO, building on the ideas 
and approaches for studying the behavior of the 
LASSO, on which there is an extensive literature; 
see Buhlmann and van de Geer (2011) and the ref- 
erences therein. Bach (2008) showed that the group 
LASSO is group selection consistent in a random de- 
sign model for fixed p under a variant of the irrep- 
resentable condition (Meinshausen and Buhlmann, 
2006; Zhao and Yu, 2006; Zou, 2006). Nardi and Ri- 
naldo (2008) considered selection consistency of the 
group LASSO under an irrepresentable condition 
and the bounds on the prediction and estimation er- 
rors under a restricted eigenvalue condition (Bickel, 
Ritov and Tsybokov, 2009; Koltchinskii, 2009), as- 
suming that the Gram matrices XjXj/n are propor- 
tional to the identity matrix. Wei and Huang (2010) 
considered the sparsity and £2 bounds on the estima- 
tion and prediction errors of the group LASSO un- 
der the sparse Riesz condition (Zhang and Huang, 
2008). They also studied the selection property of 
the adaptive group LASSO using the group LASSO 
as the initial estimate. The adaptive group LASSO 
can be formulated in a way similar to the stan- 
dard adaptive LASSO (Zou, 2006). Recently, there 
has been considerable progress in the studies of the 
LASSO based on sharper versions of the restricted 
eigenvalue condition (van de Geer and Buhlmann, 
2009; Zhang, 2009; Ye and Zhang, 2010). It would 
be interesting to extend these results to the group 
LASSO. 



A natural question about the group LASSO is un- 
der what conditions it will perform better than the 
standard LASSO. This question was addressed by 
Huang and Zhang (2010), who introduced the con- 
cept of strong group sparsity. They showed that the 
group LASSO is superior to the standard LASSO 
under the strong group sparsity and certain other 
conditions, including a group sparse eigenvalue con- 
dition. More recently, Lounici et al. (2011) conducted 
a detailed analysis of the group LASSO. They es- 
tablished oracle inequalities for the prediction and 
£2 estimation errors of group LASSO under a re- 
stricted eigenvalue condition on the design matrix. 
They also showed that the rate of convergence of 
their upper bounds is optimal in a minimax sense, 
up to a logarithmic factor, for all estimators over a 
class of group sparse vectors. Furthermore, by de- 
riving lower bounds for the prediction and £2 esti- 
mation errors of the standard LASSO they demon- 
strated that the group LASSO can have smaller pre- 
diction and estimation errors than the LASSO. 

While the group LASSO enjoys excellent proper- 
ties in terms of prediction and £2 estimation errors, 
its selection consistency hinges on the assumption 
that the design matrix satisfies the irrepresentable 
condition. This condition is, in general, difficult to 
satisfy, especially in p 3> n models (Zhang, 2010a). 
Fan and Li (2001) pointed out that the standard 
LASSO over-shrinks large coefficients due to the na- 
ture of i\ penalty. As a result, the LASSO tends 
to recruit unimportant variables into the model in 
order to compensate for its overshrinkage of large 
coefficients, and consequently, it may not be able to 
distinguish variables with small to moderate coeffi- 
cients from unimportant ones. This can lead to rel- 
atively high false positive selection rates. Leng, Lin 
and Wahba (2006) showed that the LASSO does not 
achieve selection consistency if the penalty param- 
eter is selected by minimizing the prediction error. 
The group LASSO is likely to behave similarly. In 
particular, the group LASSO may also tend to select 
a model that is larger than the underlying model 
with relatively high false positive group selection 
rate. Further work is needed to better understand 
the properties of the group LASSO in terms of false 
positive and false negative selection rates. 

2.2 Concave 2-Norm Group Selection 

The group LASSO can be constructed by applying 
the £\ penalty to the norms of the groups. Specif- 
ically, for p(t;X) = X\t\, the group LASSO penalty 
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can be written as Acj||/3„-||i^ = p( 1 1/3^ 1 1 ^ ; Cj- A) . Other 
penalty functions could be used instead. Thus a more 
general class of group selection methods can be based 
on the criterion 



(2.3) 



1 

2n 



J 

£ 



+ ^2p(\\(3 j \\R j ;c j \,j), 



where p(i;cj-A,7) is concave in t. Here 7 is an addi- 
tional tuning parameter that may be used to mod- 
ify p. As in the definition of the group LASSO, we 
assume without loss of generality that each Xj is or- 
thonormal with XjXj/n = Id 3 and = 11/3? || 2- 
It is reasonable to use penalty functions that work 
well for individual variable selection. Some possi- 
ble choices include: (a) the bridge penalty with p(x; 
A, 7) = A|x| 7 ,0 < 7 < 1 (Frank and Friedman, 1993); 
(b) the SCAD penalty with p(x; A, 7) = A Jjj min{l, 
(7 - t/A)+/(7 - 1)} dt,j>2 (Fan and Li, 2001; Fan 
and Peng, 2004), where for any a £ R, a + denotes its 
positive part, that is, a + = al{ a >o}; (c) the minimax 

concave penalty (MCP) with p(x; A, 7) = A /^'(l - 
t/(j\)) + dt,j > 1 (Zhang, 2010a). All these penal- 
ties have the oracle property for individual variables, 
meaning that the corresponding penalized estima- 
tors are equal to the least squares estimator assum- 
ing the model is known with high probability under 
appropriate conditions. See Huang, Horowitz and 
Ma (2008) for the bridge penalty, Fan and Li (2001) 
and Fan and Peng (2004) for the SCAD penalty 
and Zhang (2010) for the MC penalty. By applying 
these penalties to (2.3), we obtain the 2-norm group 
bridge, 2-norm group SCAD and 2-norm group MCP, 
respectively. Another interesting concave penalty is 
the capped-£i penalty p(t; A, 7) = min(7A 2 /2, A|i|) 
with 7 > 1 (Zhang, 2010b; Shen, Zhu and Pan, 2011). 
However, this penalty has not been applied to the 
group selection problems. 

For Cj = y/dj, the group MCP and capped-£i pen- 
alty satisfy the invariance property 



(2.4) p(||/3,|| 2 ;^A,7) = / 9( v / ^||/3,-|| 2 ;A, a ! j 7). 



Thus the rescaling of A can also be interpreted based 
on the expression on the right-hand side of (2.4). 
The multiplier -\fd~j of ||/3,-|| 2 standardizes the group 
size. This ensures that smaller groups will not be 
overwhelmed by larger groups. The multiplier dj for 
7 makes the amount of regularization per group pro- 
portional to its size. Thus the interpretation of 7 
remains the same as that in the case where group 
sizes are equal to one. Because the MCP is equiva- 



lent to the l\ penalty when 7 = 00, the i\ penalty 
also satisfies (2.4). However, many other penalties, 
including the SCAD and l q penalties with q / 1, do 
not satisfy (2.4). 

An interesting question that has not received ad- 
equate attention is how to determine the value of 7. 
In linear regression models with standardized pre- 
dictors, Fan and Li (2001) suggested using 7 ~ 3.7 
in the SCAD penalty, and Zhang (2010a) suggested 
using 7 ~ 2.7 in the MCP. Note, however, that when 
7 — > 00, the group MCP converges to the group 
LASSO, and when 7 — > 1, it converges to the group 
hard threshold penalty (Antoniadis, 1996) 



p(t;\) = \'-±(\t\-\yi {mx} . 

Clearly, the choice of 7 has a big impact on the es- 
timate. See Mazumder, Friedman and Hastie (2011) 
and Breheny and Huang (2011) for further discus- 
sion on the choice of 7. 

To illustrate this point in the grouped variable 
setting, we consider a simple example with J = 20 
groups, in which only the first two groups have non- 
zero coefficients with /3 X = (-\/2,\/2)',/3 2 = (0.5,1, 
-0.5)', so H/3J2 = 2 and ||/3 2 || 2 ~ 1.22. The sizes 
of the groups with zero coefficients are 3. The top 
panel in Figure 1 shows the paths of the estimated 
norms ||/3i|| and ||/3 2 || for 7 = 1.2,2.5 and 00, where 
7 = 00 corresponds to the group LASSO. The bot- 
tom panel shows the solution paths of the individual 
coefficients. It can be seen that the characteristics 
of the solution paths are quite different for different 
values of 7. For the 2-norm group MCP with 7 = 1.2 
or 2.5, there is a region in the paths where the esti- 
mates are close to the true parameter values. How- 
ever, for the group LASSO (7 = 00), the estimates 
are always biased toward zero except when A = 0. 

2.3 Orthogonal Groups 

To have some understanding of the basic charac- 
teristics of the group LASSO and nonconvex group 
selection methods, we consider the special case where 
the groups are orthonormal with X'-X^ = 0,j ^ k 
and X'jXj/n = In this case, the problem sim- 
plifies to that of estimation in J single-group mod- 
els of the form y = XjO + e. Let z = X'-y/n be the 
least squares estimator of 6. Without loss of gen- 
erality, let cj = 1 below in this section. We have 



n 



XjO\\l 



eili + n^HyiH 



since 



X'jXj/n = Id r Thus the penalized least squares cri- 
terion is 2 1 1 1 z — 6\\\ + p(||#|| 2 ; A, 7). Denote 



(2.5) 



S(z;t) 



1 



z 2 
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Group MCP (y=2.7) 





Fig. 1. The solution paths of the 2-norm group MCP for 7 = 1.2, 2.7 and 00, where 7 = 00 corresponds to the group LASSO. 
The top panel shows the paths of the £2 norms of /3 ■ ; the bottom shows the paths of the individual coefficients. The solid lines 
and dashed lines in the plots indicate the paths of the coefficients in the nonzero groups 1 and 2, respectively. The dotted lines 
represent the zero groups. 



This expression is used in Yuan and Lin (2006) for 
computing the group LASSO solutions via a group 
coordinate descent algorithm. It is a multivariate 
version of the soft-threshold operator (Donoho and 
Johnstone, 1994) in which the soft-thresholding is 
applied to the length of the vector, while leaving 
its direction unchanged. By taking p to be the £±, 
MCP and SCAD penalties, it can be verified that 
the group LASSO, group MCP and group SCAD 
solutions in a single group model have the following 
expressions: 

• Group LASSO: 

(2.6) d gLASSO (z;X) = S(z,X). 

• 2-norm group MCP: for 7 > 1, 

^ g MCp(z; A, 7) 

(2.7) 

= (^L I S(z,X), if||z|| 2 < 7 A, 
I z, if ||z|| 2 > 7A. 



• 2-norm group SCAD: for 7 > 2, 
^ g scAD(z;A,7) 

(2.8) 

(S(z;X), if||z|| 2 <2A, 
= \ ^(z;^-Y), if2A<||z|| 2 < 7 A, 
I z, if ||z|| 2 > 7A. 

The group LASSO solution here is simply the mul- 
tivariate soft-threshold operator. For the 2-norm 
group MCP solution, in the region ||z|| 2 > 7A, it is 
equal to the unbiased estimator z, and in the re- 
maining region, it is a scaled-up soft threshold op- 
erator. The 2-norm group SCAD is similar to the 2- 
norm group MCP in that it is equal to the unbiased 
estimator z in the region ||z|| 2 > 7A. In the region 
||z|| 2 < 7A, the 2-norm group SCAD is also related to 
the soft threshold operator, but takes a more com- 
plicated form than the 2-norm group MCP. 

For the 2-norm group MCP, g MCp(sA,7) — > 
^ g LASso(-;A) as 7-^00 and fj gM cp(-; A, 7) #(•; A) 
as 7 — > 1 for any given A > 0, where H (•; A) is the 
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hard-threshold operator defined as 

'0, if||z|| 2 <A, 
z, if ||z||2 > A. 



(2.9) 



H(z;X) = 



Therefore, for a given A > 0, {# g MCp(s A, 7) : 1 < 7 < 
00} is a family of threshold operators with the mul- 
tivariate hard and soft threshold operators at the 
extremes 7 = 1 and 00. 

For the 2-norm group SCAD, we have g scAD(s A, 
7) -> <? g LASSo(-; A) as 7-^00 and 6> g scAD(-; A, 7) -> 
H*(-; A) as 7 — > 2, where 



(2.10) H*(z;X) = 



S(z;X), 
z, 



if 
if 



|z||2 <2A, 
|z|| 2 > 2A. 



This is different from the hard threshold operator 
(2.9). For a given A > 0, {0 gSC AD(-; A, 7) :2 < 7 < 
00} is a family of threshold operators with H* and 
soft threshold operators at the extremes 7 = 2 and 
00. Note that the hard threshold operator is not 
included in the group SCAD family 

The closed-form expressions given above illustrate 
some important differences of the three group selec- 
tion methods. They also provide building blocks of 
the group coordinate descent algorithm for comput- 
ing these solutions described below. 

2.4 Computation via Group Coordinate Descent 

Group coordinate descent (GCD) is an efficient 
approach for fitting models with grouped penalties. 
The first algorithm of this kind was proposed by 
Yuan and Lin (2006) as a way to compute the so- 
lutions to the group LASSO. Because the solution 
paths of the group LASSO are not piecewise linear, 
they cannot be computed using the LARS algorithm 
(Efron et al., 2004). 

Coordinate descent algorithms (Fu, 1998; Fried- 
man et al., 2007; Wu and Lange, 2008) have be- 
come widely used in the field of penalized regres- 
sion. These algorithms were originally proposed for 
optimization in problems with convex penalties such 
as the LASSO, but have also been used in calculat- 
ing SCAD and MCP estimates (Breheny and Huang, 
2011). We discuss here the idea behind the algorithm 
and its extension to the grouped variable case. 

Coordinate descent algorithms optimize an objec- 
tive function with respect to a single parameter at 
a time, iteratively cycling through the parameters 
until convergence is reached; similarly, group coor- 
dinate descent algorithms optimize the target func- 
tion with respect to a single group at a time, and 
cycles through the groups until convergence. These 



algorithms are particularly suitable for fitting group 
LASSO, group SCAD and group MCP models, since 
all three have simple closed-form expressions for a 
single-group model (2.6)-(2.8). 

A group coordinate descent step consists of par- 
tially optimizing the penalized least squares crite- 
rion (2.1) or (2.3) with respect to the coefficients in 
group j. Define 



1 

2n 



+ P(\\Pi 



A, 7), 



where (5 denotes the most recently updated value 
of (3. Denote yj = Y^k^j x k~Pk and % = X j{y-fj)/' 
Note that yj represents the fitted values excluding 
the contribution from group j , and Zj represents the 
corresponding partial residuals. Just as in ordinary 
least squares regression, the value (3j that optimizes 
Lj(/3j; A, 7) is equal to the value we obtain from re- 
gressing (3j on the partial residuals. In other words, 
the minimizer of Lj(f3j \ A, 7) is given by F(zj; A, 7), 
where F is one of the solutions in (2.6) to (2.8), 
depending on the penalty used. 



n. 



i(0) 



(0)/ 



Let /3 w = (/3i ; ,...,/3 



J 



be the initial value, 



and let s denote the iteration. The GCD algorithm 
consists of the following steps: 

Step 1. Set s = 0. Initialize vector of residuals r = 

y - y , where y = ^/=i XjfiT ■ 

Step 2. For j = 1, . . . , J, carry out the following 

calculations: 

~ ( s ) 

(a) calculate Zj = n X'-r + (3j ; 

(b) update + ^ = F(zj; A, 7), 

~ (s+l) ~(s) 

(c) update r -<— r — Xj (f3j — /3 - ). 
Step 3. Update s<— s + 1. 

Step 4. Repeat steps 2-3 until convergence. 

The update in Step 2(c) ensures that r always 
holds the current values of the residuals, and is there- 
fore ready for Step 2(a) of the next cycle. By tak- 
ing F(-;X,j) to be g LASSo(-; A), s mcp(-;A,7) and 
# g SCAD( - ; A, 7) in (2.6) to (2.8), we obtain the solu- 
tions to the group LASSO, group MCP and group 
SCAD, respectively. The algorithm has two attrac- 
tive features. First, each step is very fast, as it in- 
volves only relatively simple calculations. Second, 
the algorithm is stable, as each step is guaranteed 
to decrease the objective function (or leave it un- 
changed) . 
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The above algorithm computes (3 for a given (A, 7) 
pair; to obtain pathwise solutions, we can use the 
algorithm repeatedly over a grid of (A, 7) values. 
For a given value of 7 5 we can start at Amax — 
maXj{||n -1 Xjy||2/cj}, for which (3 has the solution 
0, and proceed along the grid using the value of (3 at 
the previous point in the A-grid as the initial value 
for the current point in the algorithm. An alterna- 
tive approach is to use the group LASSO solution 
(corresponding to 7 = 00) as the initial value as we 
decrease 7 for each value of A. See Mazumder, Fried- 
man and Hastie (2011) for a detailed description of 
the latter approach in the nongrouped case. 

The results of Tseng (2001) establish that the 
algorithm converges to a minimum. For the group 
LASSO, which has a convex objective function, the 
algorithm therefore converges to the global mini- 
mum. For group SCAD and group MCP, conver- 
gence to a local minimum is possible. See also The- 
orem 4 of Mazumder, Friedman and Hastie (2011) 
for the nongrouped case. 

The availability of the explicit expression in step 
2(b) of the algorithm depends on the choice of Rj = 
X'-Xj/n in (2.1) or (2.3). If a different norm is used, 
then the groups are not orthonormal, and there are 
no explicit solutions to the problem. Without closed- 
form solutions, step 2(b) must be solved using nu- 
merical optimization. Algorithms proposed for com- 
puting the group LASSO solutions without using 
Rj = X'-Xj/n include Friedman et al. (2007), Jacob, 
Obozinski and Vert (2009) and Liu and Ye (2010). 
For generalized linear models, the group coordinate 
descent can be applied based on quadratic approxi- 
mations to the log- likelihood in the objective func- 
tion (Meier, van de Geer and Biihlmann (2008)). 

3. BI-LEVEL SELECTION 

The methods described in Section 2 produce es- 
timates that are sparse at the group level and not 
at the level of individual variables. Within a group, 
there are only two possibilities for the selection re- 
sults based on these methods: either all of the vari- 
ables are selected, or none of them are. This is not 
always appropriate for the data. 

For example, consider a genetic association study 
in which the predictors are indicators for the pres- 
ence of genetic variation at different markers. If a 
genetic variant located in a gene is associated with 
the disease, then it is more likely that other variants 
located in the same gene will also be associated with 
the disease — the predictors have a grouping struc- 



ture. However, it is not necessarily the case that 
all variants within that gene are associated with 
the disease. In such a study, the goal is to iden- 
tify important individual variants, but to increase 
the power of the search by incorporating grouping 
information. 

In this section, we discuss bi-level selection meth- 
ods, which are capable of selecting important groups 
as well as important individual variables within those 
groups. The underlying assumption is that the model 
is sparse at both the group and individual variable 
levels. That is, the nonzero group coefficients /3„- are 
also sparse. It should be noted, however, that less 
work has been done on bi-level selection than on 
group LASSO, and there are still many unanswered 
questions. 

3.1 Concave 1-Norm Group Penalties 

As one might suspect, based on analogy with 
LASSO and ridge regression, it is possible to con- 
struct penalties for bi-level selection by starting with 
the i\ norm instead of the £2 norm. This substitu- 
tion is not trivial, however: a naive application of 
the LASSO penalty to the l\ norm of a group re- 
sults in the original LASSO, which obviously has no 
grouping properties. 

Applying a concave penalty to the l\ norm of 
a group, however, does produce an estimator with 
grouping properties, as suggested by Huang et al. 
(2009), who proposed the group bridge penalty. The 
1-norm group bridge applies a bridge penalty to the 
l\ norm of a group, resulting in the criterion 



(3.1) 



1 

2n 



P 
li - 



where A > is the regularization parameter, 7 6 
(0,1) is the bridge index and {cj} are constants 
that adjust for the dimension of group j. For mod- 
els with standardized variables, a reasonable choice 
is Cj = \dj\^ . When dj = 1, 1 < j < J, (3.1) simplifies 
to the standard bridge criterion. The method pro- 
posed by Zhou and Zhu (2010) can be considered a 
special case of group bridge with 7 = 0.5. A general 
composite absolute penalty based on £ q norms was 
proposed by Zhao, Rocha and Yu (2009). 

Huang et al. (2009) showed that the global group 
bridge solution is group selection consistent under 
certain regularity conditions. Their results allow p — > 
00 as n — > 00 but require p < n. In contrast to the 
group LASSO, the selection consistency of group 
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bridge does not require an irrepresentable-type con- 
dition. However, no results are available for the group 
bridge in the J>n settings. 

In principle, we could apply other concave penal- 
ties to the group l\ norm as well, leading to the 
more general penalized criterion 

2 T 



(3.2) 



1 

2n 



l;cj-A,7). 



Choosing p to be the SCAD or MCP penalty in 
(3.2) would seem particularly promising, but to our 
knowledge, these estimators have not been studied. 

3.2 Composite Penalties 

An alternative way of thinking about concave 
1-norm group penalties is that they represent the 
composition of two penalties: a concave group-level 
penalty and an individual variable-level 1-norm 
penalty. It is natural, then, to also consider the com- 
position of concave group-level penalties with other 
individual variable-level penalties. This framework 
was proposed in Breheny and Huang (2009), who 
described grouped penalties as consisting of an outer 
penalty po applied to a sum of inner penalties pi. 
The penalty applied to a group of predictors is there- 
fore written as 



(3.3) 



Po[^2pi(\/3jk\) 



\k=l 



where (3jk is the fcth member of the jth group, and 
the partial derivative with respect to the jkih co- 
variate is 



(3.4) 



Po\ 



Note that the group bridge fits into this framework 
with an outer bridge penalty and an inner LASSO 
penalty, as does the group LASSO with an outer 
bridge penalty and an inner ridge penalty. 

From (3.3), we can view group penalization as ap- 
plying a rate of penalization to a predictor that con- 
sists of two terms: the first carries information re- 
garding the group; the second carries information 
about the individual predictor. Whether or not a 
variable enters the model is affected both by its indi- 
vidual signal and by the collective signal of the group 
that it belongs to. Thus, a variable with a moderate 
individual signal may be included in a model if it 
belongs to a group containing other members with 



strong signals, or may be excluded if the rest of its 
group displays little association with the outcome. 

An interesting special case of the composite pen- 
alty is using the MCP as both the outer and in- 
ner penalties, which we refer to as the composite 
MCP (this penalty was referred to as "group MCP" 
in Breheny and Huang (2009); we use "composite 
MCP" both to better reflect the framework and avoid 
confusion with the 2-norm group MCP of Section 2.2). 

The composite MCP uses the criterion 



1 

2n 



(3.5) 



3=1 



J / d j \ 



vfc=l 



where p is the MCP penalty and 70, the tuning pa- 
rameter of the outer penalty, is chosen to be dj'yi\/2 
in order to ensure that the group level penalty at- 
tains its maximum if and only if each of its com- 
ponents are at their maximum. In other words, the 
derivative of the outer penalty reaches if and only 
if |/3jfc| > 7/A Vfe e {1, . . . , dj}. 

Figure 2 shows the group LASSO, 2-norm group 
MCP, 1-norm group Bridge and composite MCP 
penalties for a two-predictor group. Note that where 
the penalty comes to a point or edge, there is the 
possibility that the solution will take on a sparse 
value; all penalties come to a point at 0, encour- 
aging group-level sparsity, but only group bridge 
and composite MCP allow for bi-level selection. In 
addition, one can see that the MCP penalties are 
capped, while the group LASSO and group bridge 
penalties are not. Furthermore, note that the indi- 
vidual variable-level penalty for the composite MCP 
is capped at a level below that of the group; this lim- 
its the extent to which one variable can dominate 
the penalty of the entire group. The 2-norm group 
MCP does not have this property. This illustrates 
the two rationales of composite MCP: (1) to avoid 
overshrinkage by allowing covariates to grow large, 
and (2) to allow groups to remain sparse internally. 
The 1-norm group bridge allows the presence of a 
single large predictor to continually lower the entry 
threshold of the other variables in its group. This 
property, whereby a single strong predictor draws 
others into the model, prevents the group bridge 
from achieving consistency for the selection of in- 
dividual variables. 
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Group LASSO 2-norm group MCP Group bridge Composite MCP 




Fig. 2. The group LASSO, group bridge and composite mcp penalties for a two -predictor group. Note that where the penalty 
comes to a point or edge, there is the possibility that the solution will take on a sparse value; all penalties come to a point at 
0, encouraging group-level sparsity, but only group bridge and composite MCP allow for bi-level selection. 



Fi gure 3 shows the coefficient paths from A max 
down to for group LASSO, 1-norm group bridge, 
and composite MCP for a simulated data set featur- 
ing two groups, each with three covariates. In the 
underlying model, the group represented by solid 
lines has two covariates with coefficients equal to 1 
and the other equal to 0; the group represented by 
dashed lines has two coefficients equal to and the 
other equal to —1. The figure reveals much about the 
behavior of grouped penalties. In particular, we note 
the following: (1) Even though each of the nonzero 
coefficients is of the same magnitude, the coefficients 
from the more significant solid group enter the model 
more easily than the lone nonzero coefficient from 
the dashed group. (2) This phenomenon is less pro- 
nounced for composite MCP, which makes weaker 
assumptions about grouping. (3) For composite MCP 
at A ~ 0.3, all of the variables with true zero coef- 
ficients have been eliminated while the remaining 
coefficients are unpenalized. In this region, the com- 
posite MCP approach is performing as well as the 



oracle model. (4) In general, the coefficient paths 
for these group penalization methods are continu- 
ous, but are not piecewise linear, unlike those for 
the LASSO. 

Although composite penalties do not, in general, 
have closed-form solutions in single-group models 
like the penalties in Section 2, the idea of group co- 
ordinate descent can still be used. The main compli- 
cation is in step 2(b) for the algorithm described in 
Section 2.4, where the single-group solutions need to 
be solved numerically. Another approach is based on 
a local coordinate descent algorithm (Breheny and 
Huang, 2009). This algorithm first uses a local lin- 
ear approximation to the penalty function (Zou and 
Li, 2008). After applying this approximation, in any 
given coordinate direction the optimization prob- 
lem is equivalent to the one-dimensional LASSO, 
which has the soft-threshold operator as its solu- 
tion. The thresholding parameter A in each update 
is given by expression (3.4). Because the penalties 
involved are concave on [0,oo), the linear approxi- 




FlG. 3. Coefficient paths from to A max for group LASSO, 2-norm group MCP, 1-norm group bridge, and composite MCP 
for a simulated data set featuring two groups, each with three covariates. In the underlying data- generating mechanism, the 
group represented by solid lines has two covariates with coefficients equal to 1 and the other equal to 0; the group represented 
by dashed lines has two coefficients equal to and the other equal to — 1. 
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mation is a majorizing function, and the algorithm 
thus enjoys the descent property of MM algorithms 
(Lange, Hunter and Yang (2000)) whereby the ob- 
jective function is guaranteed to decrease at every 
iteration. Further details may be found in Breheny 
and Huang (2009). These algorithms have been im- 
plemented in the R package grpreg, available at 
http://cran.r-project.org. The package com- 
putes the group LASSO, group bridge and compos- 
ite MCP solutions for linear regression and logistic 
regression models. 

3.3 Additive Penalties 

Another approach to achieving bi-level selection 
is to add an t\ penalty to the group LASSO (Wu 
and Lange, 2008; Friedman, Hastie and Tibshirani, 
2010). 



(3-6) — 
In 



+ A 1 ||/3|| 1 +A 2 £||/3 j 

2 3=1 



where Ai > and A2 > are regularization parame- 
ters. The above objective function has the benefit of 
being convex, eliminating the possibility of conver- 
gence to a local minimum during model fitting. The 
group coordinate descent algorithm can no longer 
be applied, however, as the orthonormalization pro- 
cedure described in Section 2 will not preserve the 
sparsity achieved by the l\ penalty once the solution 
is transformed back to the original variables. Puig, 
Wiesel and Hero (2011), Friedman, Hastie and Tib- 
shirani (2010) and Zhou et al. (2010) have proposed 
algorithms for solving this problem without requir- 
ing orthonormalization. 

In principle, the group LASSO portion of the pen- 
alty could be replaced with any of the convex 2- 
norm group penalties of Section 2.2; likewise the t\ 
penalty could be replaced by, say, MCP or SCAD. 
These possibilities, to the best of our knowledge, 
have not been explored. Further work is needed to 
study the properties of this class of estimators and 
compare their performance with other methods. 

3.4 Example: Genetic Association 

We now give an example from a genetic associa- 
tion study where bi-level selection is an important 
goal of the study. The example involves data from 
a case-control study of age-related macular degener- 
ation consisting of 400 cases and 400 controls, and 
was analyzed in Breheny and Huang (2009). The 
analysis is confined to 30 genes containing 532 mark- 



Table 1 

Application of the three group penalization methods and 
a one-at-a-time method to a genetic association data set. 
CV error is the average number of misclassification 
errors over the ten validation folds 





Genes 


Markers 


Cross-validation 




selected 


selected 


error 


One-at-a-time 


19 


49 


0.441 


Group LASSO 


17 


435 


0.390 


Group bridge 


3 


20 


0.400 


Composite MCP 


8 


11 


0.391 



ers that previous biological studies have suggested 
may be related to the disease. 

We analyze the data with the group LASSO, 
1-norm group bridge and composite MCP methods 
by considering markers to be grouped by the gene 
they belong to. Penalized logistic regression models 
were fit assuming an additive effect for all markers 
(homozygous dominant = 2, heterozygous = 1, ho- 
mozygous recessive = 0). In addition to the group 
penalization methods, we analyzed these data using 
a traditional one-at-a-time approach (single-marker 
analysis), in which univariate logistic regression mod- 
els were fit and marker effects screened using a p < 
0.05 cutoff. Ten-fold cross-validation was used to se- 
lect A, and to assess accuracy (for the one-at-a-time 
approach, predictions were made from an unpenal- 
ized logistic regression model fit to the training data 
using all the markers selected by individual testing) . 
The results are presented in Table 1. 

Table 1 suggests the benefits of using group pe- 
nalization regression approaches as opposed to one- 
at-a-time approaches: the three group penalization 
methods achieve lower test error rates and do so 
while selecting fewer genes (groups). Although the 
error rates of ~40% indicate that these 30 genes 
likely do not include SNPs that exert a large ef- 
fect on an individual's chances of developing age- 
related macular degeneration, the fact that they are 
well below the 50% that would be expected by ran- 
dom chance demonstrates that these genes do con- 
tain SNPs related to the disease. The very differ- 
ent nature of the selection properties of the three 
group penalization methods are also clearly seen. Al- 
though group LASSO achieves low misclassification 
error, it selects 17 genes out of 30 and 435 markers 
out of 532, failing to shed light on the most impor- 
tant genetic markers. The bi-level selection methods 
achieve comparable error rates with a much more 
sparse set of predictors: group bridge identifies 3 
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promising genes out of 30 candidates, and composite 
MCP identifies 11 promising SNPs out of 532. 

4. ORACLE PROPERTY OF THE 2-NORM 
GROUP MCP 

In this section, we look at the selection proper- 
ties of the 2-norm group MCP estimator /3(A,7), de- 
fined as the global minimizer of (2.3) with Cj = y/dj, 
when p is taken to be the MCP penalty. We provide 
sufficient conditions under which the 2-norm group 
MCP estimator is equal to the oracle least squares 
estimator defined at (4.1) below. Our intention is 
to give some preliminary theoretical justification for 
this concave group selection method under reason- 
able conditions, not necessarily to obtain the best 
possible theoretical results or to provide a system- 
atic treatment of the properties of the concave group 
selection methods discussed in this review. 

Let X = (X u . . . ,Xj) and S = X'X/n. For any 
A C {1 , . . . , J} , denote 

X A = (Xj ,j€A), S A = X' A X A /n. 

Let the true value of the regression coefficients be 
(3° = (/3f , . . . ,(3°J)>. Let S = {j : ||/3°|| 2 ^ 0, 1 < j < 
J}, which is the set of indices of the groups with 
nonzero coefficients in the underlying model. Let 
(3° = min{\\j3?\\ 2 / x /d~ j : j € S} and set /3° = oo if S 
is empty. Define 

(4.1) p° = argmin{||y - Xb\\ 2 :bj = Vj £ S}. 

b 

This is the oracle least squares estimator. Of course, 
it is not a real estimator, since the oracle set is un- 
known. 

Let <i max = max{dj : 1 < j < J} and d m ; n = 
mm{dj : 1 < j < J}. For any A C {1, . . . , J}, denote 
dmin(A) = minjdj :j & A} and d max (^4) = max{dj : 
j £ A}. Here d m i n (A) = oo if A is empty. Let c m i n be 
the smallest eigenvalue of E, and let c\ and c 2 be the 
smallest and largest eigenvalues of respectively. 

We first consider the case where the 2-norm group 
MCP objective function is convex. This necessarily 
requires c m i n > 0. Define the function 

h(t, k) = exp(-/c( v / 2t - 1 - l) 2 /4) , 

(4.2) 

t> l,k = 1,2,.... 

This function arises from an upper bound for the 
tail probabilities of chi-square distributions given in 
Lemma A.l in the Appendix, which is based on an 



exponential inequality for chi-square random vari- 
ables of Laurent and Massart (2000). Let 

(4.3) 7 ?lra (A) = (J-|5|)/l(AV^^min(5 C )) 

and 

(4.4) i 1 2n{X) = \S\h{c l n{^- 1 \f/a 2 ,d m , n {S)). 

Theorem 4.1. Suppose £\,...,£ n are indepen- 
dent and identically distributed as N(0,a 2 ). Then 
for any (A, 7) satisfying 7 > l/c min , > 7A and 
n\ 2 > a 2 , we have 

P(^(A, 7 )^°)<7?in(A)+WA). 

The proof of this theorem is given in the Appendix. 
It provides an upper bound on the probability that 
/3(A,7) is not equal to the oracle least squares es- 
timator. The condition 7 > l/c m j n ensures that the 
2-norm group MCP criterion is strictly convex. This 
implies (3(X,j) is uniquely characterized by the Ka- 
rush-Kuhn-Tucker conditions. The condition re A 2 > 
a 2 requires that A cannot be too small. 

Let 

A n = (j(21og(max{J-|5|,l}) 

(4.5) /(nd min (S c ))) 1/2 and 

r n = & \/2 log (max{ | S \ , 1 } ) / (nci d min (S) ) . 

The following corollary is an immediate conse- 
quence of Theorem 4.1. 

Corollary 4.1. Suppose that the conditions of 
Theorem J^.l are satisfied. Also suppose that /3° > 
7A + a n T n for a n — > 00 as n — > 00. If X> a n X n , then 

P(/3(A,7)^/3°)->-0 asn^oo. 

By Corollary 4.1, the 2-norm group MCP estima- 
tor behaves like the oracle least squares estimator 
with high probability. This of course implies it is 
group selection consistent. For the standard LASSO 
estimator, a sufficient condition for its sign consis- 
tency is the strong irrepresentable condition (Zhao 
and Yu, 2006). Here a similar condition holds auto- 
matically due to the form of the MCP. Specifically, 
let (3$ = ((3° :j G S)' . Then an extension of the ir- 
representable condition to the present setting is, for 
some < 5 < 1 , 

max || X' j X s {X' s X s r 1 p{f3° s ; A, 7 )/A|| 2 

(4.6) 

<l-<5, 
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where p(/3£;A, 7 ) = (/K||/3°|| 2 ; V^A, 7 )/3°7||/3°|| 2 : 
jeS)' with 

KW^h; 7^A,7) = A(l - ||/3 J °|| 2 /(v / ^7A)) + . 

Since it is assumed that mines' ||/3°||2/\/^i > 7A, we 
have /9(||/3°|| 2 ; yfdj^il) = for all j E S. Therefore, 
(4.6) always holds. 

We now consider the high-dimensional case where 
J > n. We require the sparse Riesz condition, or 
SRC (Zhang and Huang, 2008), which is a form of 
sparse eigenvalue condition. We say that X satisfies 
the SRC with rank d* and spectrum bounds {c*, c*} 
if 

< 00 

(4.7) 

VA with \A\ < d*, ||u|| 2 = 1. 

We refer to this condition as SRC (d* , c*, c*). 

Let K* = (c*/c*) - (1/2), ?n* = KJS\ and £ = 
l/(4c*d s ), where d s = max{<i max (5), 1}. Define 

V3n(\) = (J-\S\r*-^ 
111* 

(4-8) 

max j <^max) ■ 

Let r/i„ and r/2 n be as in (4.3) and (4.4). 

Theorem 4.2. Suppose si,...,e n are indepen- 
dent and identically distributed as N(0,a 2 ), and X 
satisfies the SRC(d*, c*, c*) in (4.7) with d* > (K* + 
l)|£|d s . Then for any (A, 7) satisfying (3° > 7A, 
nX 2 ^ > o" 2 d max and 7 > c~ l y/4 + (c*/c*), we /iaue 

P(/3(A, 7) / 0°) < mn(A) + r/ 2n (A) + r? 3 n(A). 
Letting 

A; = 2ay/2c*d s log(J - \S\)/n 

and T n be as in (4.5), Theorem 4.2 has the following 
corollary. 

Corollary 4.2. Suppose the conditions of The- 
orem 4.2 are satisfied. Also suppose /3° > 7A + a n r n 
for a n — > 00 as n-> 00. Then if A > a n A* , 

P(3(A,7) //3°) ^0 asn^oo. 

Theorem 4.2 and Corollary 4.2 provide sufficient 
conditions for the selection consistency of the global 
2-norm group MCP estimator in the J>n situa- 
tions. For example, we can have J — \S\ = exp{o(n/ 
(c*d s ))}. The condition nA 2 £ > cr 2 d max is stronger 
than the corresponding condition n\ 2 > a 2 in Theo- 
rem 4.1. The condition 7 > c~ 1 y / 4 + (c*/c*) ensures 



that the group MCP criterion is convex in any d*- 
dimensional subspace. It is stronger than the mini- 
mal sufficient condition 7 > 1/c* for convexity in d*- 
dimensional subspaces. These reflect the difficulty 
and extra efforts needed in reducing a p-dimensional 
problem to a d* -dimensional problem. The SRC in 
(4.7) guarantees that the model is identifiable in a 
lower d* -dimensional space. 

The results presented above are concerned with 
the global solutions. The properties of the local so- 
lutions, such as those produced by the group co- 
ordinate descent algorithm, to concave 2-norm or 
1-norm penalties remain largely unknown in models 
with J ^> n. An interesting question is under what 
conditions the local solutions are equal to or suf- 
ficiently close to the global solutions so that they 
are still selection consistent. In addition, the esti- 
mation and prediction properties of these solutions 
have not been studied. We expect that the methods 
of Zhang and Zhang (2011) in studying the proper- 
ties of concave regularization will be helpful in group 
and bi- level selection problems. 

5. APPLICATIONS 

We now give a review of some applications of the 
group selection methods in several statistical mod- 
eling and analysis problems, including nonparamet- 
ric additive models, semiparametric partially linear 
models, seemingly unrelated regressions and multi- 
task learning and genetic and genomic data analysis. 

5.1 Nonparametric Additive Models 

Let (jji, Xj), i = 1, . . . , n be random vectors that are 
independently and identically distributed as (y,x), 
where y is a response variable, and x = (27, . . . , x p )' 
is a p-dimensional covariate vector. The nonpara- 
metric additive model (Hastie and Tibshirani, 1990) 
posits that 

v 

(5.1) yi = + fj{xij) + Ei, l<i<n, 
3=1 

where [i is an intercept term, Xij is the jth compo- 
nent of Xi, the /j's are unknown functions and £j 
is an unobserved random variable with mean zero 
and finite variance a 2 . Suppose that some of the ad- 
ditive components fj are zero. The problem is to 
select the nonzero components and estimate them. 
Lin and Zhang (2006) proposed the component se- 
lection and smoothing operator (COSSO) method 
that can be used for selection and estimation in 
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(5.1). The COSSO can be viewed as a group LASSO 
procedure in a reproducing kernel Hilbert space. For 
fixed p, they studied the rate of convergence of the 
COSSO estimator in the additive model. They also 
showed that, in the special case of a tensor product 
design, the COSSO correctly selects the non-zero 
additive components with high probability. Zhang 
and Lin (2006) considered the COSSO for nonpara- 
metric regression in exponential families. Meier, van 
de Geer and Buhlmann (2009) proposed a variable 
selection method in (5.1) with p^> n that is closely 
related to the group LASSO. They give conditions 
under which, with high probability, their procedure 
selects a set of the nonparametric components whose 
distance from zero in a certain metric exceeds a 
specified threshold under a compatibility condition. 
Ravikumar et al. (2009) proposed a penalized ap- 
proach for variable selection in (5.1). In their theo- 
retical results on selection consistency, they assume 
that the eigenvalues of a "design matrix" be bounded 
away from zero and infinity, where the "design ma- 
trix" is formed from the basis functions for the non- 
zero components. Another critical condition required 
in their paper is similar to the irrepresentable con- 
dition of Zhao and Yu (2006). Huang, Horowitz and 
Wei (2010) studied the group LASSO and adaptive 
group LASSO for variable selection in (5.1) based on 
a spline approximation to the nonparametric com- 
ponents. With this approximation, each nonpara- 
metric component is represented by a linear com- 
bination of spline basis functions. Consequently, the 
problem of component selection becomes that of se- 
lecting the groups of coefficients in the linear com- 
binations. They provided conditions under which 
the group LASSO selects a model whose number 
of components is comparable with the underlying 
model, and the adaptive group LASSO selects the 
nonzero components correctly with high probability 
and achieves the optimal rate of convergence. 

5.2 Structure Estimation in Semiparametric 
Regression Models 

Consider the semiparametric partially linear model 
(Engle et al., 1986) 

(5.2) 

+ 22fj( x ij)+ £ ii l<i<n, 

where S\ and £2 are mutually exclusive and com- 
plementary subsets of {l,...,p}, {Pj'-j £ S\} are 
regression coefficients of the covariates with indices 



in Si and (fj :j G S2) are unknown functions. The 
most important assumption in the existing methods 
for the estimation in partially linear models is that 
Si and £2 are known a priori. This assumption un- 
derlies the construction of the estimators and inves- 
tigation of their theoretical properties in the existing 
methods (Hardle, Liang and Gao, 2000; Bickel et al., 
1993). However, in applied work, it is rarely known 
in advance which covariates have linear effects and 
which have nonlinear effects. Recently, Zhang, Cheng 
and Liu (2011) proposed a method for determining 
the zero, linear and nonlinear components in par- 
tially linear models. Their method is a regularization 
method in the smoothing spline AN OVA framework 
that is closely related to the COSSO. They obtained 
the rate of convergence of their proposed estima- 
tor. They also showed that their method is selec- 
tion consistent in the special case of tensor product 
design. But their approach requires tuning of four 
penalty parameters, which may be difficult to imple- 
ment in practice. Huang, Wei and Ma (2011) pro- 
posed a semiparametric regression pursuit method 
for estimating Si and 52- They embedded partially 
linear models into model (5.1). By approximating 
the nonparametric components using spline series 
expansions, they transformed the problem of model 
specification into a group variable selection prob- 
lem. They then used the 2-norm group MCP to de- 
termine the linear and nonlinear components. They 
showed that, under suitable conditions, the proposed 
approach is consistent in estimating the structure of 
(5.2), meaning that it can correctly determine which 
covariates have a linear effect and which do not with 
high probability. 

5.3 Varying Coefficient Models 

Consider the linear varying coefficient model 

v 

in(Uj) =y*,xik(Uj)Pk(tij) + ti(Uj), 

k=l 

i = l,...,n, j = 1,..., rii, 

where yi(t) is the response variable for the ith sub- 
ject at time point i £ T with T being the time in- 
terval on which the measurements are taken, €i(t) is 
the error term, »jfc(f)'s are time-varying covariates, 
(3k(t) is the corresponding smooth coefficient func- 
tion. Such a model is useful in investigating the time- 
dependent effects of covariates on responses mea- 
sured repeatedly. One well-known example is lon- 
gitudinal data analysis (Hoover et al., 1998) where 
the response for the ith experimental subject in the 
study is observed i%i occasions, the set of observa- 
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tions at times {tij : j = 1, . . . , rtj} tends to be corre- 
lated. Another important example is the functional 
response models (Rice, 2004), where the response 
yi(t) is a smooth real function, although only yi(tij), 
j = 1, . . . ,rii are observed in practice. Wang, Chen 
and Li (2007) and Wang and Xia (2009) considered 
the use of group LASSO and SCAD methods for 
model selection and estimation in (5.3). Xue, Qu 
and Zhu (2010) applied the 2-norm SCAD method 
for variable selection in generalized linear varying- 
coemcient models and considered its selection and 
estimation properties. These authors obtained their 
results in the models with fixed dimensions. Wei, 
Huang and Li (2011) studied the properties of the 
group LASSO and adaptive group LASSO for (5.3) 
in the p>n settings. They showed that, under the 
sparse Riesz condition and other regularity condi- 
tions, the group LASSO selects a model of the right 
order of dimensionality, selects all variables with co- 
efficient functions whose £2 norm is greater than a 
certain threshold level and is estimation consistent. 
They also proved that the adaptive group LASSO 
can correctly select important variables with high 
probability based on an initial consistent estimator. 

5.4 Seemingly Unrelated Regressions and 
Multi-Task Learning 

Consider T linear regression models 

y t =X t j3 t + e u t=l,...,T, 

where yt is an n x 1 response vector, Xt is an n x p 
design matrix, (3 t is a p x 1 vector of regression co- 
efficients and St is an n x 1 error vector. Assume 
that £1, . . . ,£t are independent and identically dis- 
tributed with mean zero and covariance matrix E. 
This model is referred to as the seemingly unrelated 
regressions (SUR) model (Zellner, 1962). Although 
each model can be estimated separately based on 
least squares method, it is possible to improve on the 
estimation efficiency of this approach. Zellner (1962) 
proposed a method for estimating all the coefficients 
simultaneously that is more efficient than the single- 
equation least squares estimators. This model is also 
called a multi-task learning model in machine learn- 
ing (Caruana, 1997; Argyriou, Evgeniou and Pontil, 
2008). 

Several authors have considered the problem of 
variable selection based on the criterion 

T p / T \ V 2 

^£>--"w.G+*i:(£# • 

t=i j=i \t=i J 

This is a special case of the general group LASSO 
criterion. Here the groups are formed by the co- 



efficients corresponding to the jth variable across 
the regressions. The assumption here is that the jth 
variable plays a similar role across the tasks and 
should be selected or dropped at the same time. Sev- 
eral authors have studied the selection, estimation 
and prediction properties of the group LASSO in 
this model; see, for example, Bach (2008), Lounici 
et al. (2009), Lounici et al. (2011) and Obozinski, 
Wainwright and Jordan (2011) under various con- 
ditions on the design matrices and other regularity 
conditions. 

5.5 Analysis of Genomic Data 

Group selection methods have important appli- 
cations in the analysis of high throughput genomic 
data — for example, to find genes and genetic path- 
ways that affect a clinical phenotype such as disease 
status or survival using gene expression data. Most 
phenotypes are the result of alterations in a lim- 
ited number of pathways, and there is coordination 
among the genes in these pathways. The genes in the 
same pathway or functional group can be treated 
as a group. Efficiency may be improved upon by 
incorporating pathway information into the analy- 
sis, thereby selecting pathways and genes simulta- 
neously. Another example is integrative analysis of 
multiple genomic datasets. In gene profiling studies, 
markers identified from analysis of single datasets of- 
ten suffer from a lack of reproducibility. Among the 
many possible causes, the most important one is per- 
haps the relatively small sample sizes and hence lack 
of power of individual studies. A cost-effective rem- 
edy to the small sample size problem is to pool and 
analyze data from multiple studies of the same dis- 
ease. A generalized seemingly unrelated regressions 
model can be used in this context, where a group 
structure arises naturally for the multiple measure- 
ments for the same gene across the studies. Some ex- 
amples of using group selection methods in these ap- 
plications include Wei and Li (2007), Jacob, Obozin- 
ski and Vert (2009), Ma and Huang (2009), Ma, 
Huang and Moran (2009), Ma, Huang and Song 
(2010), Ma et al. (2011), Pan, Xie and Shen (2010) 
and Peng et al. (2010), among others. 

5.6 Genome Wide Association Studies 

Genome wide association studies (GWAS) are an 
important method for identifying disease suscepti- 
bility genes for common and complex diseases. GWAS 
involve scanning hundreds to thousands of samples, 
often control samples, utilizing hundreds of 

thousands of single nucleotide polymorphism (SNP) 
markers located throughout the human genome. The 
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SNPs from the same gene can be naturally consid- 
ered as a group. It is more powerful to select both 
SNPs and genes simultaneously than to select them 
separately. Applications of group selection methods 
to genetic association analysis are discussed in Bre- 
heny and Huang (2009) and Zhou et al. (2010). 

6. DISCUSSION 

In this article, we provide a selective review of sev- 
eral group selection and bi-level selection methods. 
While considerable progress has been made in this 
area, much work remains to be done on a number 
of important issues. Here we highlight some of them 
that require further study in order to better apply 
these methods in practice. 

6.1 Penalty Parameter Selection 

In any penalization approach for variable selec- 
tion, a difficult question is how to determine the 
penalty parameters. This question is even more dif- 
ficult in group selection methods. Widely used cri- 
terions, including the AIC (Akaike, 1973) and BIC 
(Schwarz, 1978), require the estimation of the er- 
ror variance and degrees of freedom. For the group 
LASSO, Yuan and Lin (2006) proposed an estimate 
of the degrees of freedom, but it involves the least 
squares estimator of the coefficients, which is not 
feasible in p>n models. The problem of variance 
estimation has not been studied systematically in 
group selection models. It is possible to use -ff-fold 
cross validation, which does not require estimating 
the variance or the degrees of freedom. However, to 
our knowledge, there have been no rigorous analy- 
ses of this procedure in group selection settings. Re- 
cently, Meinshausen and Biihlmann (2010) proposed 
stability selection for choosing penalty parameters 
based on resampling. This is a general approach 
and is applicable to the group selection methods dis- 
cussed here. Furthermore, it does not require esti- 
mating the variance or the degrees of freedom. It 
would be interesting to apply this method to group 
selection and compare it with the existing methods 
in group selection problems. 

6.2 Theoretical Properties 

Currently, most theoretical results concerning se- 
lection, estimation and prediction on group selection 
methods in p 3> n settings are derived for the group 
LASSO in the context of linear regression. These 
results provide important insights into the behavior 



of the group LASSO. However, they are obtained 
for a given range of the penalty parameter. It is 
not clear whether, if the penalty parameter is se- 
lected using a data-driven procedure, such as cross 
validation, these results still hold. It is clearly of 
practical interest to confirm the estimation and pre- 
diction properties of group LASSO if the penalty 
parameter is selected using such a procedure. For 
concave selection methods, we considered the selec- 
tion property of the global 2-norm group MCP so- 
lutions. Although global results shed some light on 
the properties of these methods, it is more relevant 
to investigate the properties of the local solutions, 
such as those obtained based on the group coor- 
dinate descent algorithm. Therefore, much work is 
needed to understand the theoretical properties of 
various concave group selection methods and com- 
pare their performance with the group LASSO. 

6.3 Overlapping Groups 

In this article, we only considered the case where 
there is no overlapping among the groups. However, 
in many applied problems, overlapped groups arise 
naturally. For example, in genomic data analysis in- 
volving genes and pathways, many important genes 
belong to multiple pathways. Jacob, Obozinski and 
Vert (2009) proposed an extended group LASSO 
method for selection with overlapping groups. With 
their method, it is possible to select one variable 
without selecting all the groups containing it. Perci- 
val (2011) studied the theoretical properties of the 
method of Jacob, Obozinski and Vert (2009). Liu 
and Ye (2010) proposed an algorithm for solving the 
overlapping group LASSO problem. Zhao, Rocha 
and Yu (2009) considered the problem of overlap- 
ping groups in the context of composite absolute 
penalties. The results of Huang et al. (2009) on the 
selection consistency of the 1-norm group bridge al- 
low overlapping among groups under the assump- 
tion that the extent of overlapping is not large. How- 
ever, in general, especially for concave group selec- 
tion methods, this question has not been addressed. 

APPENDIX 

Lemma A.l. Let xl be a random variable with 
chi-square distribution with k degrees of freedom. 
For t>l, P(xl > kt) < h(t,k), where h(t,k) is de- 
fined in (Jf-2). 

This lemma is a restatement of the exponential in- 
equality for chi-square distributions of Laurent and 
Massart (2000). 
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PROOF of Theorem 4.1. Since (3 is the oracle Let Aj be a dj x ds matrix with a dj x dj identity 
least squares estimator, we have j3° = for j £ S and matrix I d . in the jth block and O's elsewhere. Then 



(A.l) 



X'(y-Xp°)/n = VjGS. 



If \\(3j W2/ \fd~j > 7A, then by the definition of the 

MCP, //(H/3JH2; ^fd~j\l) = 0. Since c min > I/7, the 
criterion (2.3) is strictly convex. By the KKT con- 
ditions, the equality f3(X,-f) = holds in the inter- 
section of the events 

(A) = {max||n- 1 Xj(y-X / 3°)|| 2 /y^ 
L its 



^A^X' s s\\ 2 



n'^AjE^X'gS. Note that 



< HA 



J 112 I 



-1/2, 



2 n 



- 1 ' 2 ^ ll2 x l , 



s £ \h 



(A.2) < A} 

fi 2 (A) = {min||^ || 2 / x /^> 7 A}. 

We first bound 1 - P(fii(A)). Let f3 s = 
S)'. By (A.l) and using y = X s f3° s + e, 

(A.3) 0° s = Xg 1 X' s y/n = P> s + Eg 1 X' s e/n. 



and 



— c l l| w ^5 ^5 £ ||2 

and lln^ 1 / 2 !;^ 1 ^ 2 ^^!!!/^ 2 is distributed as a y 2 
distribution with | jS' | degrees of freedom. Therefore, 
similar to (A. 4), we have, for c\n(f3° — j\) 2 /a 2 > 1, 

i-p(n 2 (A)) 



p 



maxn l/2 \\A j T ls 1 X' s e\\ 2 / ' yfdj 
\ j&S 



>V^(/3*°-7A)) 



(A.5) 



<P(^max||n~ 1 / 2 S 5 1/2 A^£||^/((i j o- 2 ) 



It follows that n^Xfa-Xp") = n- 1 X' j (I n -P s )e, 
where Ps = n~ 1 Xs^'g 1 X f s . Because X'-Xj = 1^, 
WX'jiln - P s )e\\l/a 2 is distributed as a x distri- 
bution with dj degrees of freedom. We have, for Combining (A.4) and (A.5), we have 
n\ 2 /a 2 > 1, 



> Cl n(/3° - 7 X) 2 /a 2 
<|5|/ l (cin(/3°- 7 A) 2 /a 2 ,d min (5)) 

= ?72n(A). 



l-P(fii(A)) 

= P(max Wn'^X^In - P s )e\\ 2 2 /{d 3 a 2 ) 

> n\ 2 /a 2 ) 
V 2 X>Al n -P s )e\\l/o 2 



in 



(A.4) 



as 

> djnX 2 j a 2 ) 
<^h{n\ 2 /a 2 ,dj) 

<(J-\S\)h(nX 2 /a 2 ,d min (S c )) 

where we used Lemma A.l in the third line. 

Now consider Q 2 . Recall (3° = min^s H^lb/ \/dj- 
If 0° - PjWz/y/dj < - 7A for all j e S, then 
min jS 5 \\j3j W2 / \/dj > 7A. This implies 



P(/3(A, 7 ) + (3 ) < 1 - P(fii(A)) + 1 - P(n 2 (A)) 

< ??ln(A) +?? 2 „(A). 

This completes the proof. □ 
For any B C {1, . . . , J} and m > 1, define 
C(v;m,B) 

\\(P A -P B )v\\ 2 



(A.6) 



max 



(ran) 1 / 2 
BC AC{1,. 



, J}, cZ,4 = m + cZb 



for v G M n , where P4 = A^A^Aa) -1 ^ is the or- 
thogonal projection from M n to the span of Xa- 

Lemma A.2. Suppose £n\ 2 > a 2 d ma , x . We have 



P(2yV4C(y;™,S)>A) 



<(j-\s\y 



m' 



■ exp(— m£nA 2 /16). 



Proof. For any A ^ S, we have (Pa - Ps) ■ 
X s f3 s = 0- Thus 

1 - P(0 2 (A)) < P(max \\(3° - ^ h /^d~ > - 7 a) . _ p s)y = { p A _ Pas){Xs(3s +e) = {Pa _ Ps)e . 
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Therefore, 



P(2 V ^d s ((T,rn,S)>\) 



= P max \\(P A - P s )e\\ z /<J Z > £mn\' 

\ADS,\A\-\S\=m 
Since Pa~Ps is a projection matrix, \\(Pa — Ps)t\\ 2 / 



a 



X mA , wh ere m A = Y.jeA~s,A^s d i ^ ™d max . 
Since there are (^^ r [ S '') ways to choose A from {1, . . . , 
J}, we have 



p(2vV4C(y;™,S)>A) 



< 



J 



in 



P(xL max >£"mA 2 ). 



This and Lemma A.l imply that 



p(2 v / ?4C(y;^,5)>A) 



< 



J-\S\ 



m 



<(j-\s\y 



;h(£n\ 2 /d max ,md msix ). 



Here we used the inequality ( J ^ 51 ) < e m ( J - \S\) m / 
m m . This completes the proof. □ 

Define T as any set that satisfies 

S U{j: \\f3j || 2 ^0} 

CTCSu{i : n-^(y-Ij3) 



P(ll/3j 



Lemma A. 3. Suppose that X satisfies the 
S RCQT, c,/ ), d* > (if* + l)|5|d B , and 7 > c,: 1 • 
yM + c^/c*. Lei = A' H ,|S r |. T/ien /or any y 6 W 1 
with A > 2y / c*dJ('(y; m*, S*) ; w;e aawe 

|T| < (tf„ + l)|S|. 

Proof. This lemma can be proved along the 
line of the proof of Lemma 1 of Zhang (2010a) and 
is omitted. □ 

PROOF of Theorem 4.2. By Lemma A. 3, in 
the event 



(A.7) 2 v / c*d max {S)((y;m*,S)<\, 

we have \T\ < {K* + l)^. Thus in event (A.7), the 
original model with J groups reduces to a model 
with at most (K* + 1)\S\ groups. In this reduced 
model, the conditions of Theorem 4.2 imply that 
the conditions of Theorem 4.1 are satisfied. By 



Lemma A. 2, 

(A.8) P(2 v / c*d max (5)C(y;m„5) > A) < n 3n (A). 

Therefore, combining (A.8) and Theorem 4.1, we 
have 

P(f3(X, 7) ¥= P°) < VlnW + mn(\) + V3n(X). 

This proves Theorem 4.2. □ 
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